Skip to content

[PERF] Optimize LingBot-World-Fast chunk profiling and runtime paths#3

Merged
lzx1413 merged 8 commits into
Tele-AI:mainfrom
yJader:feature/lingbot-world-fast-profiling
Jun 30, 2026
Merged

[PERF] Optimize LingBot-World-Fast chunk profiling and runtime paths#3
lzx1413 merged 8 commits into
Tele-AI:mainfrom
yJader:feature/lingbot-world-fast-profiling

Conversation

@yJader

@yJader yJader commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Description

This PR adds finer-grained LingBot-World-Fast profiling controls and improves chunk generation performance by reducing avoidable layout conversions, removing eager Triton LayerNorm wrapper overhead, and avoiding CUDA scalar index synchronizations in the self-attention KV cache. It also exposes local attention window settings through the pipeline config so the runtime can size and use the self-KV cache according to the requested window.

Motivation

Profiling showed several costs that were either hard to attribute or avoidable in the current LingBot-World-Fast runtime:

  • Torch profiler defaults (record_shapes, profile_memory, with_stack) can add high overhead and distort pipeline-level traces.
  • VAE CausalConv3d and DiT patch embedding receive NCDHW inputs while cuDNN Conv3d kernels prefer channels_last_3d; on current PyTorch/cuDNN this does not fall back to slow_conv_dilated3d, but it still triggers repeated implicit NCHW/NHWC layout transforms.
  • The eager Triton LayerNorm path spends most of its apparent profiler time in Python/HOP/autotuner wrapper overhead rather than GPU compute.
  • KV cache index tensors require .item() reads from CUDA tensors, introducing device-to-host synchronization.
  • Local attention and sink-size settings need to flow from pipeline config into DiT/runtime cache sizing for profiling and generation experiments.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Performance improvement
  • Code refactoring
  • Documentation update
  • Other (please describe):

Changes Made

  • Profiling controls and trace ranges

    • Add env-driven torch profiler options:
      • TELEFUSER_TORCH_PROFILER_RECORD_SHAPES
      • TELEFUSER_TORCH_PROFILER_PROFILE_MEMORY
      • TELEFUSER_TORCH_PROFILER_WITH_STACK
    • Keep historical profiler defaults when env vars are unset or invalid.
    • Add ProfilingContext4Debug ranges for workloop, create_runtime, generate_next_chunk, denoise_chunk, kv_cache_update_forward, and vae_decode.
  • LingBot-World-Fast local attention configuration

    • Add local_attn_size and sink_size to LingBotWorldFastPipelineConfig.
    • Pass those options into LingBotWorldFastDiT.from_pretrained.
    • Size the self-KV cache from the local attention window when local_attn_size > -1.
  • Conv3d layout optimization

    • Convert DiT patch embedding input to torch.channels_last_3d before self.patch_embedding.
    • Convert WanVideoVAE.CausalConv3d input to torch.channels_last_3d before nn.Conv3d.forward.
    • This is not relying on slow_conv_dilated3d avoidance in the current environment. With torch 2.12.1+cu130 and cuDNN 9.20, baseline already uses cuDNN Conv3d. The current gain comes from trading one explicit DtoD layout copy for many fewer implicit cuDNN NCHW/NHWC transforms.
  • LayerNorm eager path

    • Route LayerNorm.forward_cuda to the native PyTorch implementation in eager mode.
    • This removes the profiler false hotspot caused by torch.library.wrap_triton / HOP / Triton autotuner wrapper cost on small LayerNorm kernels.
  • KV cache index synchronization

    • Store global_end_index and local_end_index as host Python int values instead of CUDA tensors.
    • Keep a compatibility helper for existing tensor values, but update the runtime path to write ints.
    • This avoids .item()-driven DtoH syncs in CausalSelfAttention.forward.
  • Packaging/import robustness

    • Add a fallback __version__ = "0.0.0+unknown" when telefuser._version is absent in a source checkout.
  • Tests added

    • tests/unit/utils/test_profiler_flags.py

Testing

  • Targeted unit tests pass
  • Manual testing performed
  • Benchmarks added/updated (if applicable)
python -m pytest -q tests/unit/utils/test_profiler_flags.py

Result: passed, 2 tests.

Checklist

  • Code follows the project's coding standards (ruff)
  • Pre-commit hooks pass (pre-commit run --all-files)
  • All tests pass (pytest tests/)
  • New tests added for new functionality
  • Documentation updated (README, CLAUDE.md, docstrings)
  • Commit messages are clear and descriptive
  • PR title follows the convention: [TYPE] Brief description

Related Issues

N/A

Additional Notes

  • Earlier analysis identified an old-environment Conv3d issue where PyTorch 2.9.1 + cuDNN 9.10 could route bf16/fp16 5D Conv3d to aten::slow_conv_dilated3d. The current benchmark environment is different: torch 2.12.1+cu130, CUDA 13.0, cuDNN 9.20, H100. In this environment, the baseline no longer goes through slow_conv_dilated3d; the observed VAE win is from reducing implicit layout transforms.

GPU Architecture Support

  • SM80 (Ampere, Ada Lovelace)
  • SM90 (Hopper H100)
  • SM100+ (Blackwell)

No new custom CUDA/Triton kernels are added. Runtime measurements in this draft were collected on NVIDIA H100 GPUs. The code changes use PyTorch memory-format and native operator paths, so there is no new architecture-specific kernel support matrix to validate.

Performance Impact

Primary no-profiler benchmark:

  • Config:
    • case 03
    • frame_num=201
    • chunk_size=3
    • local_attn_size=21
    • sink_size=3
    • max_area=399360
    • --no-write-video
    • CUDA timing sync enabled
    • summary skips 1 warmup chunk and reports 16 steady-state chunks
  • Environment:
    • torch 2.12.1+cu130
    • CUDA 13.0
    • cuDNN 9.20
    • NVIDIA H100
Metric Baseline Modified Delta Relative
generate_next_chunk_seconds.mean 2.932498 s 2.862083 s -0.070415 s/chunk -2.40%
denoise_seconds.mean 2.043713 s 2.036162 s -0.007551 s/chunk -0.37%
update_cache_seconds.mean 0.500482 s 0.498508 s -0.001974 s/chunk -0.39%
decode_seconds.mean 0.388245 s 0.327375 s -0.060869 s/chunk -15.68%
total_seconds.mean 2.932894 s 2.862172 s -0.070722 s/chunk -2.41%

At this resolution/config, decode accounts for roughly 86% of the steady-state generate_next_chunk improvement.

Profiler trace attribution:

  • Config:
    • case 03
    • frame_num=89
    • chunk_size=3
    • local_attn_size=21
    • sink_size=3
    • max_area=99840
    • profiler enabled for create_runtime,generate_next_chunk
  • Important caveat: profiler-enabled total_seconds is dominated by profiler overhead. Use stage timing and GPU kernel attribution, not end-to-end profiler wall time.

Profiler timing summary:

Metric Baseline Modified Delta Relative
generate_next_chunk_seconds.mean 1.024707 s 0.625288 s -0.399418 s/chunk -38.98%
denoise_seconds.mean 0.745963 s 0.436910 s -0.309053 s/chunk -41.43%
update_cache_seconds.mean 0.165109 s 0.094973 s -0.070136 s/chunk -42.48%
decode_seconds.mean 0.108267 s 0.090073 s -0.018194 s/chunk -16.80%

DiT GPU attribution from analyze_telefuser_dit_profile.py:

Trace Chosen GPU pid Raw GPU time Clean GPU time
Baseline 1 460.566 ms 448.443 ms
Modified 1 472.059 ms 459.920 ms

The DiT GPU kernel time is not the source of the current speedup in this trace.

VAE/GPU0 copy-layout attribution:

Metric Baseline Modified Delta
GPU0 total kernel time 86.184 ms 74.733 ms -11.451 ms (-13.29%)
layout/copy family, excluding host copies ~27.16 ms ~16.36 ms ~-10.8 ms
torch_direct_copy_kernel 14.641 ms / 434 launches 8.782 ms / 263 launches -5.859 ms
cudnn_nchw_to_nhwc 7.561 ms / 207 launches 0.351 ms / 12 launches -7.210 ms
cudnn_nhwc_to_nchw 3.392 ms / 105 launches 0.158 ms / 6 launches -3.234 ms
Memcpy DtoD 0.447 ms / 70 launches 5.950 ms / 350 launches +5.503 ms

Interpretation: the explicit x.contiguous(memory_format=torch.channels_last_3d) increases visible Memcpy DtoD, but it removes more implicit cuDNN NCHW/NHWC transforms and direct-copy work. This is why the modified branch can show higher Memcpy DtoD while still reducing total VAE decode time.

Supplemental Trace Figures

The following profiler screenshots are included as supplementary evidence. The numeric benchmark tables above remain the source of truth for this PR.

Historical VAE Conv3d trace from the earlier PyTorch/cuDNN environment. This explains why the channels_last_3d change was originally investigated. It should not be read as the current torch 2.12.1+cu130 behavior, where baseline no longer falls back to slow_conv_dilated3d.

image

Profiler-side generate_next_chunk / VAE timing view, showing the stage-level before/after context that motivated the VAE decode attribution.

image

LayerNorm profiler hotspot. The linked GPU kernel is only tens of microseconds, while the visible range is dominated by eager wrap_triton / HOP / autotuner wrapper cost.

image

LayerNorm fix context: route eager execution to the native PyTorch implementation instead of the small Triton wrapper path.

image

KV cache index trace showing DtoH synchronization from reading CUDA scalar index tensors with .item(). The PR changes those indices to Python int values in the runtime path.

image

@lzx1413

lzx1413 commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

LGTM

@lzx1413 lzx1413 merged commit 84e3f10 into Tele-AI:main Jun 30, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants